Multimodal learning

General-purpose neural networks capable of handling diverse inputs and output tasks
See:

VLMs

Resources

Code

#CODE Pykale - Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research
#CODE Unilm - Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Courses

#COURSE Multimodal Machine Learning (Carnegie Mellon University)
- https://cmu-multicomp-lab.github.io/mmml-course/fall2020/

Books

#BOOK Multimodal Deep Learning (Akkus 2023)
- https://slds-lmu.github.io/seminar_multimodal_dl/index.html

References

#PAPER Multi-modal Transformer for Video Retrieval (Gabeur 2020)
#PAPER #REVIEW Recent Advances and Trends in Multimodal Deep Learning: A Review (Summaira 2021)
#PAPER Perceiver: General Perception with Iterative Attention (Jaegle 2021)
- https://www.zdnet.com/article/googles-supermodel-deepmind-perceiver-is-a-step-on-the-road-to-an-ai-machine-that-could-process-everything/
- Multi-model with image, audio, video, 3d point clouds
#PAPER PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python (Lu 2021)
#PAPER Perceiver IO: A General Architecture for Structured Inputs & Outputs (Jaegle 2021)
- #CODE https://paperswithcode.com/paper/perceiver-io-a-general-architecture-for
#PAPER VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text (Akbari 2021)
- #CODE https://paperswithcode.com/paper/vatt-transformers-for-multimodal-self
- VATT is trained to learn multimodal representations from unlabeled data using Transformer architectures
#PAPER NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion (Wu 2021)
- #CODE https://paperswithcode.com/paper/nuwa-visual-synthesis-pre-training-for-neural
- Paper explained
- NÜWA consists of an adaptive encoder that takes either text or visual input, and a pre-trained decoder shared by 8 visual tasks
- 3D Nearby Attention mechanism (3DNA) is proposed to reduce computational complexity and improve visual quality of results, by considering the locality characteristics for both spatial and temporal axes to better deal with the nature of the visual data
#PAPER data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language (Baevski 2022)
- #CODE https://github.com/pytorch/fairseq/tree/main/examples/data2vec
- https://ai.facebook.com/blog/the-first-high-performance-self-supervised-algorithm-that-works-for-speech-vision-and-text/
#PAPER A Generalist Agent (Reed 2022)
- Paper explained
- New approach, inspired by large-scale language models, that acts a single generalist agent. The agent, called Gato, is built to work as a multi-modal, multi-task, multi-embodiment generalist policy
#PAPER Towards artificial general intelligence via a multimodal foundation model (Fei 2022)
#PAPER Language Models are General-Purpose Interfaces (Hao 2022)
- #CODE https://github.com/microsoft/unilm
#PAPER NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis (Wu 2022)
- #CODE https://github.com/microsoft/NUWA
- https://nuwa-infinity.microsoft.com/#/
#PAPER Pixtral 12B (2024)
- Announcing Pixtral 12B | Mistral AI | Frontier AI in your hands
- Pixtral Large | Mistral AI | Frontier AI in your hands